28 research outputs found
Efficient two-sample functional estimation and the super-oracle phenomenon
We consider the estimation of two-sample integral functionals, of the type
that occur naturally, for example, when the object of interest is a divergence
between unknown probability densities. Our first main result is that, in wide
generality, a weighted nearest neighbour estimator is efficient, in the sense
of achieving the local asymptotic minimax lower bound. Moreover, we also prove
a corresponding central limit theorem, which facilitates the construction of
asymptotically valid confidence intervals for the functional, having
asymptotically minimal width. One interesting consequence of our results is the
discovery that, for certain functionals, the worst-case performance of our
estimator may improve on that of the natural `oracle' estimator, which is given
access to the values of the unknown densities at the observations.Comment: 82 page
Recommended from our members
USP: an independence test that improves on Pearson's chi-squared and the G-test.
We present the U -statistic permutation (USP) test of independence in the context of discrete data displayed in a contingency table. Either Pearson's χ 2 -test of independence, or the G -test, are typically used for this task, but we argue that these tests have serious deficiencies, both in terms of their inability to control the size of the test, and their power properties. By contrast, the USP test is guaranteed to control the size of the test at the nominal level for all sample sizes, has no issues with small (or zero) cell counts, and is able to detect distributions that violate independence in only a minimal way. The test statistic is derived from a U -statistic estimator of a natural population measure of dependence, and we prove that this is the unique minimum variance unbiased estimator of this population quantity. The practical utility of the USP test is demonstrated on both simulated data, where its power can be dramatically greater than those of Pearson's test, the G -test and Fisher's exact test, and on real data. The USP test is implemented in the R package USP
USP: an independence test that improves on Pearson's chi-squared and the G-test.
We present the U -statistic permutation (USP) test of independence in the context of discrete data displayed in a contingency table. Either Pearson's χ 2 -test of independence, or the G -test, are typically used for this task, but we argue that these tests have serious deficiencies, both in terms of their inability to control the size of the test, and their power properties. By contrast, the USP test is guaranteed to control the size of the test at the nominal level for all sample sizes, has no issues with small (or zero) cell counts, and is able to detect distributions that violate independence in only a minimal way. The test statistic is derived from a U -statistic estimator of a natural population measure of dependence, and we prove that this is the unique minimum variance unbiased estimator of this population quantity. The practical utility of the USP test is demonstrated on both simulated data, where its power can be dramatically greater than those of Pearson's test, the G -test and Fisher's exact test, and on real data. The USP test is implemented in the R package USP
EFFICIENT MULTIVARIATE ENTROPY ESTIMATION VIA k-NEAREST NEIGHBOUR DISTANCES
Many statistical procedures, including goodness-of-fit tests and methods for
independent component analysis, rely critically on the estimation of the
entropy of a distribution. In this paper, we seek entropy estimators that are
efficient and achieve the local asymptotic minimax lower bound with respect to
squared error loss. To this end, we study weighted averages of the estimators
originally proposed by Kozachenko and Leonenko (1987), based on the -nearest
neighbour distances of a sample of independent and identically distributed
random vectors in . A careful choice of weights enables us to
obtain an efficient estimator in arbitrary dimensions, given sufficient
smoothness, while the original unweighted estimator is typically only efficient
when . In addition to the new estimator proposed and theoretical
understanding provided, our results facilitate the construction of
asymptotically valid confidence intervals for the entropy of asymptotically
minimal width
Optimal rates for independence testing via U-statistic permutation tests
We study the problem of independence testing given independent and
identically distributed pairs taking values in a -finite, separable
measure space. Defining a natural measure of dependence as the squared
-distance between a joint density and the product of its marginals, we
first show that there is no valid test of independence that is uniformly
consistent against alternatives of the form . We
therefore restrict attention to alternatives that impose additional
Sobolev-type smoothness constraints, and define a permutation test based on a
basis expansion and a -statistic estimator of that we prove is
minimax optimal in terms of its separation rates in many instances. Finally,
for the case of a Fourier basis on , we provide an approximation to
the power function that offers several additional insights. Our methodology is
implemented in the R package USP.Comment: 58 pages, 4 figure
Discussion of 'Multivariate Fisher's independence test for multivariate dependence'
Invited discussion for Biometrika of 'Multivariate Fisher's independence test
for multivariate dependence' by Gorsky and Ma (2022).Comment: 4 page
Efficient functional estimation and the super-oracle phenomenon
We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a corresponding central limit theorem, which facilitates the construction of asymptotically valid confidence intervals for the functional, having asymptotically minimal width. One interesting consequence of our results is the discovery that, for certain functionals, the worst-case performance of
our estimator may improve on that of the natural ‘oracle’ estimator, which itself can be optimal in the related problem where the data consist of the values
of the unknown densities at the observations
Recommended from our members
Optimal nonparametric testing of Missing Completely At Random, and its connections to compatibility
Given a set of incomplete observations, we study the nonparametric problem of testing whether data are Missing Completely At Random (MCAR). Our first contribution is to characterise precisely the set of alternatives that can be distinguished from the MCAR null hypothesis. This reveals interesting and novel links to the theory of Fr\'echet classes (in particular, compatible distributions) and linear programming, that allow us to propose MCAR tests that are consistent against all detectable alternatives. We define an incompatibility index as a natural measure of ease of detectability, establish its key properties, and show how it can be computed exactly in some cases and bounded in others. Moreover, we prove that our tests can attain the minimax separation rate according to this measure, up to logarithmic factors. Our methodology does not require any complete cases to be effective, and is available in the R package MCARtest